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Abstract 

Graph simulation (using graph schemata or data guides) has been suc- 
cessfully proposed as a technique for adding structure to semistructured 
data. Design patterns for description (such as meta-classes and homomor- 
phisms between schema layers), which are prominent in the object-oriented 
programming community, constitute a generalization of this graph simula- 
tion approach. 

In this paper, we show description applicable to a wide range of data 
models that have some notion of object (-identity), and propose to turn 
it into a data model primitive much like, say, inheritance. We argue that 
such an extension fills a practical need in contemporary data management. 
Then, we present algebraic techniques for query optimization (using the 
notions of described and description queries). Finally, in the semistruc- 
tured setting, we discuss the pruning of regular path queries (with nested 
conditions) using description meta-data. In this context, our notion of 
meta-data extends graph schemata and data guides by meta-level values, 
allowing to boost query performance and to reduce the redundancy of data. 



1 Introduction 

Research on the management of meta-data is an important issue in the database 
context. It is relevant to a wide variety of data management problems, including 
aspects of query optimization and physical storage management. 

There are two major directions regarding meta-data in the data management 
context. The first one, which has received the greater share of attention, is 
the one of declaring the term meta-data equivalent to database schemata and 
providing tools and formalisms in which meta-data are closely integrated with 
the reasoning facilities. This view has resulted in a large body of research (e.g. 
higher-order logics like F-Logic 0] and HiLog ||, extended query languages such 
as SchemaSQL and MD-SQL P5[ , and systems such as ConceptBase |16|j ). 
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An alternative view, that of meta-data as data that describe data, has re- 
ceived much coverage in the field of object-oriented software engineering |29|, || 
- with work on meta-models and meta-classes, and design patterns with descrip- 
tion semantics - but has seen only rudimentary formal treatment (e.g. [|12|]) and 
apparently no attention in the database arena. In this paper, we will focus on this 
second view of meta-data, as data explicitly stored in a database, featuring spe- 
cial description relationships with distinct "instance-level" data. The description 
semantics can be profitably applied to problems such as semantic query opti- 



mization [14|, H [L3|, [21|, and deciding the containment or equivalence of 
queries under meta-data interpreted as constraints (e.g., using the Chase [§, |25|), 
problems that are of wide practical interest. 

Our argumentation builds on two simple concepts, 

• the description pattern between classes and meta-classes (where each in- 
stance of a meta-class - that is, a meta-object - describes a certain category 
of objects, instances of classes), and 



• the homomorphism pattern |29| between relationships of classes and rela- 
tionships of meta-classes, entailing that for each link between objects that 
belongs to a relationship under a homomorphism on the instance level, a 
describing link on the meta-level must exist. 

Although many of the terms used in this paper are borrowed from object- 
oriented data models, we argue that the results presented apply to many real- 
world databases using different data models (such as semantic relational and 
semistructured) as well. 

Example 1.1 Consider the case of a hypothetical database of a customer-centric 
car company. The company takes pride in providing each customer with a thor- 
oughly customized product. This is achieved using a production management 
information system that provides flexibility on two levels. 

• Firstly, the company follows an approach of offering a more fine-grained 
granularity of choices than a mere fixed number of car models in its prod- 
uct line. Rather, a number of platforms can be combined with a choice of 
engine models, dimensions of wheels and tires, body shapes, and interior 
design. Thus, the customer is offered a number of models of these con- 
stituent components, together with a compatibility matrix concerning their 
design specifications^. 

• The second degree of freedom is achieved by parameters that can be selected 
for each of the components, for instance the color of the car body or the 
materials to be used for furnishing the car's interior. 



1 Certainly, the customer needs help from a car dealer or software assistant on the Web when 
ordering his or her dream car. 
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Figure 1: UML schema of Example |1 . 1| . 

The assembly status of each order can be tracked by the customer via the 
company's Web site. For traceability, information about assembly lines, opera- 
tors, and parts from suppliers that are involved in the production of each order 
are recorded in the information system as assembly proceeds. 

A simplified object-oriented database schema of our hypothetical informa- 
tion system is shown in Figure [I]. This core schema covers the assembly trees 
of cars as-built (via the class Part and the association part), customer-order re- 
lationships, and the parameters chosen for the particular order (using the class 
Property). Car models and abstract components of cars as-designed are stored as 
meta- objects using the meta-classes Part' and Property' . For each object of the 
classes Part and Property, there is exactly one meta-object, reachable through its 
meta link. To put a maximum amount of abstract design-level information into 
this meta-level, the associations Part'. part' and Part'. prop' link meta-objects and 
simulate the associations Part. part and Part. prop, respectively, in the following 
natural manner: Whenever two objects are linked on the instance- level, their 
corresponding meta-objects are linked on the meta-level. For a simple database 
instance coherent with our schema, see Figure This figure shows a customer or- 
der for a red sports utility vehicle (SUV), together with the meta-data describing 
the car model. 

The layered schema design of Figure |l] has several strong points, including a 
neat separation of design-level and construction-level information, and the avoid- 
ance of redundancy. Note that description data are not automatically generated 
in some way by the system, but have been carefully acquired by our car company 
and are amongst its most valuable "treasures". 

We address the problem of optimizing queries over such databases. Assume 
the following query (1): 

select x from Part as x 



3 




Figure 2: An example car order (bottom) and corresponding meta-data (above). 
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where exists x.part.prop as p 

where p.meta.name="color" and p. value = "red"; 



which asks for all parts with a direct subpart of red color. Assume now that we 
know of the integrity constraints that (a) only car bodies (i.e., objects of Part' 
with category= "car_body" ) can have the property "color" 

{"car_body"} = select body.category from Part' as body 
where exists body.prop' as p 

where p. name = "color" 



and (b) parts that have direct subparts of category "car_body" are always of 
category "car", thus 

{"car"} = select car. category from Part' as car 
where exists car. part' as body 

where body.category = "car_body"; 

It is now easy to see that under constraints (a) and (b), (1) is equivalent to 

select car from Part 

where car. meta. category = "car" 

and exists car. part as body 

where body.meta.category = "car_body" 

and exists body.prop as p 

where p. met a. name = "color" and p. value = "red"; 

Intuitively, what we have done to optimize (1) was to make use of the restric- 



tion introduction heuristics of semantic query optimization. Alternatively 
to the use of integrity constraints, one may take advantage of the intuition that 
usually the amount of meta-data available is much smaller than the amount of 
instance-level data. Consider the following query (2). 

select x from Part' as x 
where exists x.part'.prop' as p 

where p.name="color"; 

It is easy to see that the meta-objects of the parts returned by our first query 
(1) must be contained in the result of (2). We call (2) the description query of (1). 
Since we may assume the meta-level database to be small, we may execute the 
met a- level-only query (2) and then use the result for optimizing (1). For instance, 
if our integrity constraints (a) and (b) hold, all objects in the result of (2) will 
be of category "car" , and we can again optimize our input query by introducing 
this restriction. Moreover, if the result of (2) is empty, (1) is unsatisfiable. □ 
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The work presented in this paper requires that schemata and database appli- 
cations are designed to make use of our description semantics; indeed, we make 
description a data model primitive. This is not an arbitrary decision, as these 
description semantics are very natural, and become increasingly widespread due 
to the popularity (and obvious merits) of object-oriented analysis and design and 
the influence of |29[ on its community, for flexibility reasons as well as moti- 
vated by recent outspoken standardization efforts ^7f following the approach of 
description advocated in this paper. Examples are large enterprise repositories 
and engineering databases p6[ [H| |3(| [l7j , meta-data initiatives motivated by 



interoperability considerations | 26fl , or database applications in which schemata 
need to be dynamically changeable and extensible (leaving aside schema evolu- 
tion, however). 

This leads us to semi-structured databases. In this context, the closely related 
mechanism of graph simulation is widely used for query optimization (e.g. |j, |T||) 
as well as a weak and flexible mechanism for typing objects (consider e.g. data 
guides M and graph schemata |p| P"T|). 

Also, note that the creation of dedicated structures for representing meta-data 
results often quite straightforwardly from a principled schema design process, 
independently from the underlying database paradigm or any special aspiration 
towards building a layered database. 

The contributions of this paper are the following. 

• We provide - to the best of our knowledge - the first treatment of the meta- 
data semantics used by the object-oriented software engineering community 
in a database context. 

• We define a fragment of relational algebra which we call the described 
queries and which is associated to a class of description queries that can be 
obtained through a simple transformation. Given a described query, we can 
easily obtain its description query that will, when evaluated, return exactly 
all the meta-data that apply to the result of the described query. More 
exactly, there is a functional relationship between a tuple of a described 
query and the tuple that contains all the relevant meta-data. 

• We show a few equivalences that allow to use results from a description 
query to optimize the described query - notably, each restriction obtained 
for a description query can directly be pushed down to the described query, 
and the emptiness of the result of the description query entails that the 
described query is empty, too. The rationale of this is that a meta-level 
database is usually by many orders of magnitude smaller than the cor- 
responding instance-level database, while it usually contains much of the 
database values that are meaningful to users and are named in query condi- 
tions to restrict queries, and which are thus relevant for optimizing queries. 
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We go into the details of applying our results within the framework of the 
Chase in the case of conjunctive queries. 

Our results are unostentatious yet elegant, permitting the application of 
optimizations analogous to those known for path queries (e.g. pruning 
using graph simulation [[□]]) to relational algebra-like query languages. Q 

• We approach the problem of optimizing regular path queries using a meta- 
level database in a semistructured data model. We present a new algorithm 
for optimizing path queries (with conditions) using our flavor of meta-data. 
In this context, we show that our meta-data semantics basically reduces to 
the well-known issue of graph simulation extended by description values. 
This extension allows for interesting new data management applications. 

This paper is structured as follows. First we provide some preliminaries (Sec- 
tion H). In Section |273|, we discuss the meta-data semantics of the object-oriented 
software engineering community. Section |] defines described and description 
queries and contains our results regarding query optimization in the framework 
of relational algebra. Section f| contains our work on optimizing regular path 
queries with conditions in the semistructured database context. We conclude 
with a discussion of our work and its implications (Section ffy. 

2 Preliminaries 

In this section, we first define an object-based data model which also generalizes 
from semantic relation-based data models. Next, to prepare the definition of our 
description semantics, we define a general notion of binary relationships, which is 
needed to be able to deal with, say, many-to-many relationships in data models 
that need to split such relationships into several relations. Finally, we provide 
definitions of our description semantics. 

2.1 Data Model 

The following sets of atomic elements are countably infinite and pairwise dis- 
joint: a domain of constants D = {dx, d 2 , ■ ■ ■}, a set of object identities (oids) 
O = {oi, o 2 , . . .}, a set of class names {Pi, P 2 , . . .}, a set of relation names 
{Ri, R 2 , • • •}, and a set of attribute names {Ax, A 2 , . . .}. We use the notation 
(Ai : • • • , . . . , Ak : ■ • •) for tuples formed by k > 1 distinct attributes. 

The set of o-values is the smallest set containing DUO s.t. if Vx, ■ ■ ■ , Vk (with 
k > 0) are o-values and Ax, ■ ■ ■ , Ak are attributes, then (Ax : vx, ■ ■ ■ , A^ : Vk) 

2 The main reason why we use relational algebra (embedded into a semantic data model) 
is its simplicity; however, our results are easily translatable to algebraic object-oriented query 
languages. 
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and {v i, . . . , v k } are o- values as well. An o-value assignment for a finite set R of 
relation names is a function p mapping each R G R to a /imte set of o-values. 
An oid assignment for a finite set P of class names is a function n mapping each 
P G P to a finite set of oids. 7r is called disjoint if 7r(Pi) fl 7r(P 2 ) = for all 
Pi,P 2 G P with Fx ^ P 2 . 

The set of expressions, types(P), over a finite set of class names P is 
given by the following BNF syntax 

t = D\P\(A 1 :t, A k :r)\{r} 

where r is a type expression, P G P, and k > 1. 

A class hierarchy (P, T, ^) is a tuple of a set of class names P, a function 
T : P — > types(P), and a partial order ^ over P for which we can define a 
subtyping relationship < over types(P) as follows. < is the smallest partial order 
s.t. 

1. Pi^P 2 implies P 1 < P 2 

2. n < r 2 implies {n} < {r 2 } 

3- if Tj i < Tj i2 for each i G {1, . . . , n}, then 

(^4i : Ti,i, • • • , Ai : T n ,i, Si : Ti j3 , . . . , P m : r mj3 ) < (Ai : Ti ;2 , . . . , A„ : 

(P,T, z<) is called well-formed if for each pair P±, P 2 G P, Pi ^ P 2 implies 
T(Pi) < T(P 2 ). For each class P G P, we can define the extension n* based on 
the disjoint extension tx as 7r*(P) = {7r(P ) | P ^ P, P , P G P}. 

The semantics of types is defined as follows. For each type expression r, its 
disjoint interpretation [r] w is 

1. p] w = D 

2. [P]^ = tt*(P) for each P G P 

3. [(Ai : ti, A fc : T k )] w = 

{(Ax : ui, . . . , A k : v k ) | fc > 1, v t G [tJ^, i = 1, . . . , fc} 

4- [{r}] w = {{ui, • • • , Vfc} | G [t] t , i = 1, . . . ,k} 

A schema S is a tuple (R, P, T, where R is a finite set of relation names, 
P is a finite set of class names, T is a function from R U P to types(P), and 
(P, T, 2<) is a well-formed class hierarchy. Of course, a relation is a set of tuples 
of atomic elements, and the types of relation names in R are defined accordingly. 

An instance I of schema (R, P, T, ■<) is a triple (p, it, u), where p is an o-value 
assignment for R, tt is a disjoint oid assignment for P, and v is a (total) function 
from the set of oids in I to o-values, such that p(R) C [T(P)] 7r for each R G R, 
and v(o) G [T^P)]^ for each P G P and each o G 7r(P). Each oid in I must 
belong to exactly one 7r(P). 
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2.2 Semantic Relationships 

Let To, . . . , T n G types (P). A path expression on type To is an expression tq.A\. ■ • ■ .A n 
such that for each < i < n — 1, U = (. . . , A i+ i : Tj+i, . . .) or U = (. . . , A i+ i : 
{r i+ i}, . . .) where U = T{ri) if Ti G P and U = Ti otherwise. 

The semantics of path expressions is given by their (binary) extension rela- 
tions. For a path expression r.A of length one we have the extension relation 
[t.A] = {(vi,v 2 ) | ui G [n] and ^ 2 G [r 2 ] and f} 

r = (..., A: v 2 , ...) . . . n G P, T(n) = (..., A : r 2 , .. .) 

I i/fa) = <■■■, ^ :X, ...) and u 2 GX ... n G P, T(n) = (..., A : {r 2 }, .. .) 

I «i = (..., A : u 2 , ...) • • • n G" P, T(n) = (..., A : r 2 , . . .) 

[ Vl = (. . . , A : X, . . .) and v 2 G X ... n G" P, T(n) = (•••, A : {r 2 }, . . .) 

The extension relation [r .Ai. ■ ■ • .A n \ for path expressions of length n is de- 
fined as 

[T .A 1 . ■ ■ ■ .An] = [T .A 1 ] 0...0 [T n _!.A n ] 

where o denotes the composition of binary relations (RioR 2 = {(vi,v 3 ) \ (vi,v 2 ) G 
Ri A (v 2 ,v 3 ) G R 2 }). 

Definition 2.1 A (binary) relationship v between classes Pi and P 2 may be of 
two forms. 

1. The extension relation of a path expression P\.Ai. ■ ■ ■ .A n from Pi to P 2 , 
i.e. v C [PJ x {P 2 j. 

2. Conjunctive views 

v{O u 2 ) <- R 1 (X 1 ) A ... A R k {X k ) A ^(d,!, 1>2 ) A ... A v m {O mil , O m , 2 ). 

where R\, . . . , R k G R, . . . , v m are binary relations of the former type 
(i.e., defined by path expressions), and Oi, 2 appear in the Xi or O i; j. We 
require the connectedness of the graph of the view body and that joins over 
oids are typed. Conjunctive views may contain constants from the domain 
of constants D in relation columns typed D in the X^. 

□ 

A simple (binary) relationship between classes Pi and P 2 is either a relation 
R G R with T(P) = (A\ : Pi, A 2 : P 2 ) or a path expression of length one. 

Example 2.2 Consider the schema of Example |1 . 1| . We define four (simple) 
relationships 

part := Part. part part' := Part' .part' 
prop := Part.prop prop' := Part' .prop' 

□ 
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The need for binary relationships in their general form is justified by the 
intricacies that certain data models and schemata may offer. 



Example 2.3 Consider again our car domain from Example 1.1 



Assume that the assembly tree of parts stores information on when and by 
whom a part was assembled. The type of the class Part could be 



T(Part) = (meta : Part', 

assembly : {date : D, operator : D, part : Part}, 
prop : Property) 

The relationship [Part .assembly .part] covers the part-subpart semantics 
of the part association of Example Ol 

• A directed edge-labeled graph can be represented using a single node class 
PgP and a relation R E R with T(R) = (P, D, P). For instance, the fact 
that o 2 is apart' of 0\ (in Example |1.1|) is evidenced by (oi, part', o 2 ) E R. A 
binary relationship part is thus defined as a view (here, in datalog notation) 

part'iPu P 2 ) <- R(P U part', P 2 ). 



□ 



2.3 Description Semantics 

Definition 2.4 A description relation P' desc P between classes P, P' E P sat- 
isfies the following requirements. 

1. Its transitive closure is antisymmetric. 

2. desc is closed with respect to -<: If P' desc P, then {(Pq, Pq) \ Pq ^ 
P, Pq =< P'} C desc . 

3. desc is one-to-one up to inheritance: If P[ desc P and P2 desc P, then 
P{ r< P 2 or ^ F{. If P' desc Pi and P' desc P 2 , then P : ■< P 2 or 
P 2 d Pi- 

We call a class P described iff there is a class P' s.t. P' desc P. P' is called 
(the) meta- class of P. We define a function /i which maps each oid o E [P]^ to 
an oid /x(o) E {P%, for all P.P'gP s.t. P' desc P. □ 

Next we adapt the notion of homomorphisms in object-oriented schemata of 
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Definition 2.5 A homomorphism relation is a binary relation over binary re- 
lationships. Let V horn V with V C {P[j„ x [P^ and V C [Pj^ x [P 2 ] ff . 
Then, 

1. P[ desc P x and P 2 desc P 2 and 

2. The transitive closure of horn is antisymmetric. 

3. horn is one-to-one: If V horn V\ and V horn V^, then V\ = V^. If V 7 / horn V 
and V 2 horn V, then Vj = 

4. (oi,o 2 ) G V implies (/i(oi), //(02)) G V" for all 01 G [Pi]^ and o 2 G [P}]^. 

□ 

Relationships related by homomorphisms entail a natural layering of schemata 
(which may be partial, however). Note that both desc and horn can be defined 
to be transitive, which is practical if more than one layer of description exists. 
However, then, the definition of p becomes somewhat problematic. 

Definition 2.6 A schema with description is a tuple (R, P,T, -<, desc , horn ), 
where (R, P, T, -<) is a schema, desc is a description relation on P, and horn 
is a homomorphism relation over R, P, and desc . An instance with description 
is a tuple (p, 7T, u, p), where (p, ir, v) is an instance and p is defined as above. □ 



Example 2.7 Consider the schema of Example and the relationships of Ex- 
ample p2"| . For the database schema S = (R = 0,P,T,^, desc , horn) of 
Example we have 



P = {Part, Property, Customer, Part', Property'} 

T(Part) = (part : {Part}, prop : {Property} , meta : Part) 

T(Part') = (name : D, category : D,part : {Part'}, prop : {Property'}) 

T(Property) = (meta : {Property'}, value : D) 

T(Property') = (name : D) 

desc = {(Part' , Part) , (Property' , Property)} 

horn = {(part' , part) , (prop' , prop)} 

Inheritance is not used, thus ^ is the identity relation on P. 
The instance of Example is shown in Figure |2|. For example, 

K°0 = ( n( ^me : "01_model_Ml" , category : "car", part' : {o 2 ,o 3 ,o 4 }, prop' : 0) 

and p — 0, lPart'] n = {o l , o 2 , o 3 , o 4 }, [Property% = {o 5 ,o 6 ,o 7 }. As the UML 
data model lacks description semantics, we have represented description relation- 
ships by mandatory associations "meta" from classes to meta-classes. We have 
p(o) = o.meta for all objects o of described classes. □ 
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3 Described Queries 



In this section, we present some of the main ideas of this paper. These ideas - 
centering around the notions of described and description queries - are not so 
much dependent on a particular data model (a notion of object or entity, however, 
is required), and even less so on a particular query language. To make our point 
as simple and intuitive as possible, we use a query language that is based on 
well-known (positive) relational algebra. However, we dare a few modifications. 

• Queries are based on classes and binary relationships (as presented earlier). 
Let P be a class. Then, P is a query with the semantics {(o) | o G [P]^}. 
(Thus, we obtain a set of tuples of objects rather than a set of objects. This 
is a shortcut to simplify the subsequent presentation.) 

• All joins are of the form Qi tx R ix Q2 (abbreviated Q\ p<1r Q2), where R is 
a binary relationship, or R 00 Q (abbreviated \Xr Q), where both columns 
of R are joined with Q. f] 

• Queries are typed in compliance with P and ^ (i.e., only columns of the 
same type r = P G V may be joined). 

• The relational selection operation a tests attributes typed D of the objects 
occurring in query tuples rather than the objects themselves (see Exam- 

pie Ob- 

• We define a new relational selection operator a' supporting our intuition 
that the attributes of a meta-class are in some sense also owned by the class 
described by it. The semantics of the operation is 

<(P) = {<o>|oG[P]., (Mo)) G <x 7 (P')}, 

where P' desc P, and 7 is a boolean selection condition (using A, V, -1) over 
attributes of P'. 

By $i, we denote the i-th column of a query. 

Example 3.1 Consider again query (1) of our running example. In our query 
language, it is formulated as Q 1 = 7r$i(c$3. va i ue = "red" (^1)) with 

Qi = (Part tx p - rt Part) tx) prhp{ $ 2 ,$i) a ' n ame = "color" ( Pro P ert y) 
The type of Q\ is (Part) and T(Q' l ) = (Part, Part, Property) . □ 

3 Thus, such joins also cover path expressions of object-oriented query languages given ap- 
propriately defined relationships. 
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Definition 3.2 Let (R, P,T,^, desc , hom ) be a layered database schema. 
The set of described queries (DQ) is the smallest set that satisfies 

1. If P' desc P, then P is a DQ. 

2. If Qi and Q 2 are described queries with the types T(Qi) = (. . . , A^ : 
P u ...) and T(Q 2 ) = (..., B j : P 2 , . . .), P C [PJ, x [P,],, and P' hom P, 
then (Qi Odfl^i,^-) Q2) is a DQ. 

3. If Q is a DQ, T(Q) = (..., A> : P u Aj : P 2 , ...), RC [Pj, x [P 2 ] ff , 
and P' hom P, then (oo^a^a,-) Q) is a DQ. 

4. If Q is a DQ and 7 is a selection condition over a class in M(Q), then cr'^(Q) 
is a DQ. 

5. If Q is a DQ with T(Q) = (A x : P 1} . . . , A n : P n ), then ^....^(Q) with 
ii, . . .,i m e {1, . . . ,n} is a DQ. 

6. If Q\ and Q 2 are described queries with T(Qi) = T(Q 2 ), then (Qi U Q2) 
and (Qi n Q 2 ) are described queries. |— | 

The function M translates (rewrites) a described query into its corresponding 
query on the meta-level: 

Definition 3.3 For each described query Q, the function M maps Q to its so- 
called corresponding description query. M is defined as 

M(P) = P' 

M(Q 1 ^ R Q 2 ) = M(Q 1 ) txi R , M(Q 2 ) 

M(m r (Q)) = ixi (M(Q)) 

M(a;(Q)) = a 7 (M(Q)) 

WK,...,a,JQo)) = vr Aiii ... )Aim (M(Q )) 

M(Qi n Q 2 ) = M(Q0 n M(Q 2 ) 

M(Qi U Q 2 ) = Af(Qx) U M(Q 2 ) 

if P' desc P, P' hom P, and T(Q ) = ( A i Pi, A n ■ P n ), k,---,i m € 

{1, . . . , n}, Pi, . . . , P n e P; for Qx n Q 2 and Q x U Q 2 : if T(Qi) = T(Q 2 ). □ 



Example 3.4 Query Q[ of the previous example is a described query. We have 
M(Q[) = (Part' txi^ Part') tx pr2pl{$2M) a name = U(Xj]o ^ (Property') 

□ 
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It is easy to see that M is reversible, i.e., that we can use M~ l to translate 
queries "down" one meta-level. 

We overload /i as follows. Let Q be a described query and t — (oi, . . . , o n ) be 
a tuple of Q. Then, /i.i = (fi(oi), . . . ,/i(o„)) (i.e., fi.i denotes the element-wise 
application of /i to t) and fJ,(Q) = {fi.i \ t G Q}. We can define an operator x M 
analogous to the relational semijoin operator s.t. Q k m Q' = {i G Q | >u.t G Q'} 
(where Q must be a described query). Because of the linearity of /i, we have 
MQi x Q2) = A*(Qi) x and h(-k s (Q)) = tt s (^(Q)) (where S is any set of 

columns to be projected). 

The important property of the description queries is that for each tuple in 
a described query (i G Q) there is a tuple t' G M(Q) that contains all of its 
meta-data. This leads us to the following theorem. 

Theorem 3.5 Let Q be a described query. For all t G Q, fi.t G M(Q). □ 

We can also write this as n(Q) C M(Q). 

Proof Sketch It is easy to show fi(Q) C M(Q) by induction in the size of the 
described query. The induction starts with atomic queries of the form P G P: 
Let P' desc P. By definition of we have /x(P) CF' = M(P). 

For the remaining steps, we assume that //(Q) C M(Q) holds for queries Q 
Q2, •••) of the next smaller size. 

1. We split joins into the computation of the cartesian product (where appro- 
priate) and a selection operation 1x1^. Let M(Q 1 x Q 2 ) = M(Qi) x M(Q 2 ). 
It is clear that fi(Q 1 x Q 2 ) = n(Qi) x ix{Q 2 ) Q M(Q ± ) x M(Q 2 ) because 

^(c^R (Q)) ^ ixifl' (M(Q)) if .R' hom i? by the definition of the homomor- 
phism semantics (4): (oi,o 2 ) G R implies yu((oi,o 2 )) G R' . 

2. Let 7 be over class P only in Q. We can push the selection cr' 7 (Q) down the 
syntax tree of the query until we obtain a subexpression a 1 (P). However, 
by definition, a' 7 (P) = Fk^^P'). Since //(<r 7 (P)) = //(P x^a 7 (M(P))) = 
(7 7 (M(P)) = M(a 7 (P)), our claim holds. 

3. Projection: our claim immediately follows from the premise fi(Q) C M(Q) 
and the linearity of //. 

4. Union: /i(Qi U Q 2 ) = fi(Qj) U /x(Q 2 ). Because of fi(Qi) C M(Qj), //(Qi) U 
A*(Q 2 )CM(Q 1 )UM(Q 2 ). 

5. Intersection: as for union. □ 

Now, for described queries Q and cr 7 (Q), we have Q = Q x M M(Q) and 
o-'(Q) = Q k m cr 7 (M(Q)), respectively. 



14 



Corollary 3.6 For a described query Q, 

1. if M(Q) is unsatisfiable or empty, Q is unsatisfiable, and 

2. if M{Q) = <r 7 (M(Q)), then Q ee ^(Q). 

□ 



Note that Theorem |375| does not hold if we extend described queries by Qi \ Q 2 
(for described queries Q x and Q 2 ) with M(Q 1 \ Q 2 ) = M{Q X ) \ M(Q 2 ). 

Example 3.7 Let P{ desc Pi, P^ desc P 2 , P' horn P, tt{P{) = {o[}, tt(P^) = 
{o' 2 }, tt(Pi) = {01,1,01,2}, tt(P 2 ) = {o 2 }, P = {(o lj2 ,o 2 )}, P = {< ;,o 2 >}, and 
g = Pi\tt$i(Pi m« P 2 ). We have Q = {(o 1A )} and M(Q) = P{ \ n $1 (P[ M ff 
P^) = 0; thus, n{Q) <l M{Q). □ 



Let us now consider the semantic query optimization problem |14| , [19|, §], 
where we basically want to use semantics in the form of integrity constraints to 
optimize queries. 

A known problem with integrity constraints encoding data semantics (dif- 
ferently from dependencies that encode schema semantics such as foreign key 
constraints) is that in real-world databases, such constraints are rarely available, 
because providing them requires a fair amount of work and understanding of 
database issues from users. For this reason, there has been work on automati- 
cally deriving (mining) useful integrity constraints [32]. This of course requires 



to make a closed-world assumption for the data; integrity constraints have to be 
changed when updates to the database occur. Now, the application of such tech- 
niques to meta-level databases has special advantages; particularly, meta-data 
change less frequently than instance-level data and the size of meta-data may be 
by many orders of magnitude smaller than the corresponding instance-level data. 
Still, most of the conditions in practical queries are actually based on descriptive 
values^ that are likely to be represented as meta-data if appropriate means for 
description (as they are proposed in this paper) are available. 
Our results motivate a number of ways to optimize queries. 

• Given a query Q, we can find subexpressions of Q which are described 
queries. For those, we compute the corresponding description queries and 
apply semantic query optimization techniques to them, notably the restric- 
tion introduction and unsatisfiability detection heuristics [[TJ], [19], || (using 



Corollary |37y) . The results found are then applied to the input query Q. 

Alternatively, one can simply execute description queries and apply the 
results to original queries. The intuition here is that meta-data tend to 
be much smaller that instance-level data, which makes such an approach 
practically feasible. 



We may expect users' memories to be focussed on the essentials. 



15 



• A third alternative is to translate integrity constraints derived for the meta- 
level down to the instance level, basically by using the inverse of the M 
function. These constraints can subsequently be used to optimize queries 
using a conventional optimizer. 

Note that the first two optimization approaches were explained earlier in 
Example The algebraic approach followed in this section allows to easily 
implement our method in many algebraic query optimizers. 

Conjunctive Queries 

Let us briefly consider the case of conjunctive described queries, which are de- 
scribed queries without the union and set difference operations and where se- 
lection conditions do not use the conjuncts V or We use a standard logical 
notation for conjunctive queries. 

For conjunctive queries, we want to be able to incorporate our notions of 
described and description queries into the successful Chase framework [|], ||] . 
This is to be able to apply previous results (see e.g. f2|) on optimality for many 
classes of integrity constraints. 

Indeed, this goal is easily achieved by merging^ a described query Q with 
its description query and optimize the combined query Q k m M(Q). To such a 
combined query, we can then apply the Chase procedure with numerous classes 
of integrity constraints. A natural class of constraints well-suited for semantic 
query optimization (and the automatical inferral of constraints from data) is the 
implication integrity constraints |3T|. Two examples of such constraints are 
(a) and (b) in Example |1 . 1| . The problems of optimizing conjunctive queries using 
implication integrity constraints and their incorporation into the Chase are well 
understood [[J^, |31fl . Informally, all we need to do is to consider the atoms of the 



query body as a frozen database and to interpret a set of implication integrity 
constraints like a datalog program on this database. 

Example 3.8 Consider the query Q[ of our running example. Q := Q[ k m 
M(Q[) is a conjunctive query 

Q{X) «- PartiXj, part(X 1 ,X 2 ), Part{X 2 ), pfop(X 2 ,X 3 ), Property(X 3 ), 

Part'(X[), part'(X[,X^, Part'{X' 2 ), prop'(X> 2 , X' 3 ), Property(X' 3 ), 
{j,(Xi,X[), fi(X 2 ,X 2 ), fi(X 3 ,X 3 ), [X' 3 .name = "color"). 
J To determine described subexpressions of an arbitrary conjunctive query Q, we first need to 



rewrite Q into a query that uses the relationships (defined as conjunctive views in Definition 2.1) 
for which we have defined homomorphisms. This is the problem of rewriting queries using views, 
which is well understood (e.g. |23|]). 
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The integrity constraints (a) and (b) of Example |1.1| are implication integrity 
constraints of the form 

(X. category = "car_body") <— Part'(X), prop'(X,Y), Property' '(Y), 

(Y.name = "color"). 
(X. category = "car") <- Part'(X), part' (X,Y), Part' (Y), 

(Y. category = "car_body"). 

As pointed in Example |1.1|, we can optimize Q by first introducing the re- 
striction (X' 2 . category = "car_body") using (a) and then (X[. category = "car") 
using (b). □ 

By considering Q k m M(Q) rather than Q, we have introduced some redun- 
dancies that may be eliminated at the end of the Chase (as we usually under- 
stand by global optimality (also) a minimal number of joins) by first removing 
all meta-relationships simulating instance- level relationships present in the query 
and then removing those occurrences of meta-classes that are not made necessary 
by cr'-selections. 

Note, however, that in practice seemingly redundant meta-level relationships 
may help to reduce query execution costs. If, say, Q — Pi Xr P2 and M(Q) is 
much easier to evaluate than Q, the equivalent query (Pi M(Q)) x\r (P2 
M(Q)) may be faster to execute than Q. 



4 Optimizing Path Queries 

In this section, we apply the concepts introduced so far to a semistructured 
database setting, and assume data to be schema-less. To keep up with the notions 
used so far, P consists of two classes, namely instance-level nodes and meta-level 
nodes. There are, however, several relationships, distinguished by labels. In order 
not to need to explicitly represent the horn relation, we make the assumption 
that relationships on the instance and meta-levels are under a homomorphism iff 
their names are the same. Each instance- level node or edge is described. 

In the following, let T be a set of tag names, A be a set of attribute names, 
and the language Ca be defined as the set of all attribute assignments a = s, 
where a G A and s is a string. 

Definition 4.1 A graph database is a node-labeled rooted graph G = (V, r, lab, E), 
where r is the only node in V whose in-degree is and lab : V — > T x 2 Ca is the 
labeling function. □ 

We use the abbreviation tag s.t. tag(v) — t iff 3X : lab(v) = (t,X). Given a 
graph database I = (Vi, 77, labj, Ej), a graph database M = (Vm, Tm-, labu, E M ) 
is a meta-level database for I if we can define a function jj, : Vj — > Vm s.t. 
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<db description=true> 

<part id="01_model_Ml" category="car"> 
<part id="suv_platf orml" /> 

<part id="150hp_suv_enginel"><prop name="engine_id" /></part> 

<part id="nostalgic_suv_bodyl" category="car_body"> 
<prop name = "color" /> 
<prop name = "sliding_roof " /> 
</part> 
</part> 
</db> 



Figure 3: Meta-data of Example as an XML file. 




name=" SUV_platforml 




name=" engine_id ! 



name=" 150hp_SUV_engine" 





p) e ]name=" color 



category=" car_body" 
name=" nostalgic_SUV_bodyl 



CD 



name=" sliding_roof ' 



Figure 4: Meta-data FSA for Example |1 . 1| . 
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1. for each v G V}, tagi(v) = tagM^iy)) and 



2. for each pair Ui,u 2 G V/, (^1,^2) E Ei => (^(vi) , (jl(v 2 )) G £m- 
We consider the following query language. 

Definition 4.2 (Query language). The abstract syntax of regular path queries 
with nested conditions is defined by the grammar 

start: path 

path: tag | path '.' path \ path '|' path | '('pai/i')*' | path '['conds']' 

conds: cond \ cond 'and' conds 

cond: path | attr '=' string 

where "attr" is a set of attribute names, "tag" the set of HTML tag names, and 
"string" the set of strings. ('.' and '|' are associative, respectively.) We denote 
this query language by Cq. 

The semantics of our query language is the one of classical regular path queries 
extended by a fragment of XPath |33| conditions. Notably, a query as a condition 



is true iff it returns a nonempty result w.r.t. the current context node. 

Given the graph database G = (V, r, lab, E). Let 71", 7Ti,7r 2 G Cq denote queries 
in our language and 7,71,72 G £ 7 denote conditions. We define the semantic 
function E : Cq xV->2 y for query expressions and C : £ 7 x V — > {true, false} 
for conditions. 



E[t]n = 


{n | /a6(n) = 


E[7r 1 .7r 2 ]n = 


{n'" | n' G E[7n]n A n'" G E[7r 2 ]n" A E(n", n'")} 


EJtT! 1 7r 2 ]n = 


E[7Ti]n U E[yr 2 ]n 


E[7r*]n = 


{n | (n,n) G i?*} 


E[7r[ 7 ]]n = 


{n' | n' G E[tt] A C[7]n' = true} 


C[ 7l A 72 ]n = 


C[ 7l ]n A C[ 72 ]n 


CMn = 


|E[7r]n| > 


C[a = s]n = 


f true . . . lab{n) = (t, X) A (a, s) E X 
false . . . otherwise 



where R n = {(n, n') \ n' G E[7r]n} and R* is the reflexive and transitive closure 
of the relation R n . The semantics of query Q is E[Q]r. □ 



Regular path queries are a powerful and theoretically elegant formalism. The 
extension by nested conditions is required to be able to make use of (meta- 
level) values that are important to our approach. To improve readability, we 
will underline queries and attribute selection expressions in the remainder of this 
paper. 
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Definition 4.3 Given a graph database G = {V, r, lab, E), its (nondeterministic) 
finite state automaton (FSA) A is denned as A = (Q = V, s = r, 5, F = Q) , where 

5 = {(vi, (t,$),v 2 ) | (vi,v 2 ) e E A lab(v 2 ) = (t,C)} 
U {(v,(e,C),v) | v E V, lab{v) = (t,C)} 

We denote A by FSA(G). □ 



Algorithm 4.4 Given a query Q 6 Cq, its regular expression [Qj is computed 
using the productions 



[TTLTTj =► KJ.M [7T! I 7T 2 ] =► [TTj I [7T 2 I 



7T- ^ 7T 7T ^ 7T 



[t]^(t,0) [tt[C]] <e,C) 

where 7T, 7Ti, 7t 2 are paths, t ET, and C G 2 £< 3 uz: - 4 . That is, the regular expression 
uses symbols of the alphabet S = (T U {e}) x 2 £ « u£a . □ 

In the following, FSA(Q) will translate a query Q into its associated FSA. 
This is effected by first computing a regular expression E from Q and then 
computing an equivalent (A-freeQ) nondeterministic FSA for E. Furthermore, 
we will denote by RPQ(Q) the inverse operation of FSA(Q). Here, we first 
translate the automaton to a regular expression and then reverse the above 
transformation to merge labels back into the expression to obtain a query e- 
transitions could be introduced more sparingly, but are needed for queries of the 
form • • • (path)*[cond]. 

Example 4.5 The query part* [prop [name="engine_id"]] is translated into the 
regular expression 

(part, 0)*.(e, {prop[name="engine_id"]}), 

which is again equivalent to the FSA Q = ({qi, q 2 }, qi, 5, fe}) with 

5 = {(qt, (part, 0), qx), (q x , (e, { prop [name= "engine_id" ] }) , q 2 ) } 

Observe that the translation of a query into its regular expression stops at 
subexpressions (such as prop [name= "engine Jd"]) that have been moved into a 
label; those are not further transformed at this point. □ 



6 A common compositional technique p| for computing an equivalent FSA for a regular 
expression needs to introduce dummy transitions which are usually called e-transitions but are 
not to be confused with ours (thus, we refer to them as A-transitions). Our e-transitions which 
we have introduced to carry query conditions must not be elminated at this point. 
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Algorithm (t^d) o [M={Q2 ^s 2 , F2 ), q ' 2 ) (*2,C 2 ) 
returns I G (T x 2 £ « u£a ) or _L : 
if (ti 7^ t 2 ) return _L; 
else if (ti = t 2 ^ e) return (ti, Ci); 
else if(ti = t 2 = e) { 
C" := 0; 

for each a = s G (Ci R £a) do { 

if( a = s' G C 2 and s 7^ s') return _L; 
C" := C" U | a = s }; 

} 

for each query G (Ci fl £q) do { 
Af: = (Q2,Q , 2 ,S 2 ,F 2 ); 
V := FSA(Q') k A4'; 

i^T 3 ' recognizes the empty language) return _L; 
else C := C U {RPQ(P')}; 

} 

return (e, C')\ 

} 

Figure 5: The o algorithm 

Definition 4.6 The binary operation Q x M. on automata takes two (nondeter- 
ministic) FSA (say, Q = (Qi,s\,5i,Fi) and M. = (Q 2 , s 2 , 5 2 , F 2 )) and computes 
a new nondeterministic FSA as follows. 

Qx M = (Q 1 x Qs,^!,^),^^ x F 2 > 

with 

8 = {((qi, q2},h°(M,q' 2 )k, (q[, q 2 )) I (qi, h, q[) G 8 1: (q 2 , l 2 , q' 2 ) G 5 2 , ho (M ^ } l 2 ^ _L} 
Here, o denotes the operation defined by the algorithm of Figure □ 

Note that Ax B would compute the product automaton Ax B of two nonde- 
terministic FSA if l\ o l 2 would return l\ in case that l\ = l 2 and _L otherwise. A 
product automaton Ax B recognizes the intersection of the languages recognized 
by A and £>; as such, product automata are very well suited for restricting a 
query by the structure of a meta-level database. 

It is easy to see that in case that a query Q does not contain conditions, 
FSA(Q) x FSA(M) actually is equivalent to FSA(Q) x FSA(M). Otherwise, we 
check for each condition in a query whether the meta-level database satisfies it. 
This is the case if 
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• for attribute assignments: Here, all depends on whether M contains some 
attribute assignment for the same attribute at the current position. If so, 
the assignments have to be the same. If such an assignment is missing in 
M, we assume that the restriction applies to instance-level values only, and 
we go on. [] 

• for path queries Q'\ We compute an altered meta-level database FSA A4' 
whose start state is the one the current transition (of which we process the 
label) leads to. The condition Q' is satisfied by M iff FSA(Q') x M' (thus, 
x is recursively defined) does not recognize the empty language. 

There are various possibilities regarding which restrictions available in M (but 
not required in Q) could be added to the query. For instance, in a semistruc- 
tured database in which description is made a primitive, we could add the object 
identifiers of meta-objects as conditions to the query, and by /j -1 , we would 
immediately obtain the state extents of the instance-level objects. 

Theorem 4.7 Let M be a graph database and Q be a query. Then, RPQ(FSA(Q) x 
FSA(M)) is equivalent to Q on all graph databases I simulated by M. □ 

Example 4.8 Let Q be as defined in Example [4.5| and M. be the automaton for 
the meta-data of Example |TTT|. Then, QkA4 = ({qi, 92} x {oo, • • • , 07}, (91, o ), 5, {<?2}x 
{o , . . . , o 7 }) with 

$ = { «?i,o ), (part,0), (qi,ox)), 
((quoi), (part,0), (91,02)), 
((quoi), (part,0), (91,03)), 
((qi,oi), (part,0), (?i,o 4 )), 

((91,03), (e, { prop [name= "engine Jd" ] } ) , (92, o 3 ))} 

By applying RPQ to Q K M, we obtain the pruned query 

part .part [prop [name= "engine_id" ]] 

which is equivalent to our original query for all databases described by M. 

We abbreviate prop[name="engine_id"] as X. To establish, say, the bottom- 
most of the transitions in 5, we had to evaluate (e, {X}) o M o3 (e, {X}) using the 
algorithm of Figure || We obtain 

0! = ({9i, 92, 93}, 9i, {{qi,prop,q 2 ), (92, (e, { [name= "engine_id" ] } ) , 93 ) } , {93}) 

and M! is M with start state o 3 . Of course, RPQ(Q' x M!) = X and thus 
(e,{X})o Mo3 (6,{X}) = (e,{X }). " □ 

7 Using some kind of schema - such as an XML DTD - for M, such guesses could be avoided, 
but here we assume M to be schema-less. 
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While the schema of Figure [I] needed to be cyclic (through the part' associ- 
ation) to be able to represent (in our running example: aggregation) trees, the 
tree meta-data are acyclic. Although obvious, it is enjoyable to observe that 
consequently, all pruned path queries are certain to be star-free. 

5 Conclusions 

A main goal of this paper was to propose description as a data model primitive, 
to justify this by examples and likely prospects, and to show it applicable to 
a wide range of data models beyond the semistructured. In fact, our notion 
of description is used in many practical large systems today, as it aids in their 
creation and maintenance. By including description into data model semantics, 
we may support a wide range of optimizations in an elegant way. 

Clearly, our semantics are first-order and can certainly be handled by the 
most general of previous approaches to semantic query optimization (see e.g. 
H). However, these solve a problem too general to be efficiently manageable in 
practice. 

In the case of semistructured data, our description semantics extends graph 
schemata by values on the meta-level. While this may appear as a minor exten- 
sion at first, we feel that it allows for a whole range of new applications. Indeed, 
our meta-data are likely to be human-designed, shared artifacts much like, say, 
XML DTD's, while graph schemata traditionally have been conceived as auto- 
matically generated data hidden to humans. Here we also envision interesting 
future work. 

We have implemented prototype optimizers for both the case of conjunctive 
queries in an object-oriented data model and semistructured path queries. Ex- 
periments risk to be misleading and are not reported in this paper. Clearly, since 
we have the design authority over both (meta-level) schemata and description 
meta-data, we can synthesize virtually any optimization speedup we like. 

A demo version of the latter system, XDES, a Java framework for optimizing 
path queries, can be accessed at 

trttp : //www . dbai . tuwien . ac . at/ staff /koch/ xdes/ . 

This framework is compatible with standard DOM parsers for reading meta-data 
from XML. Queries to be optimized can be specified in one of several languages, 
including a large fragment of XPath (here, we are compatible with the Apache 
Xalan framework) and the language discussed in Section |] of this paper. We plan 
to support further languages such as XQuery [0 in the future. 
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